Distributed High-performance Web Crawlers: A Survey of the State of the Art

نویسنده

  • Dustin Boswell
چکیده

Web Crawlers (also called Web Spiders or Robots), are programs used to download documents from the internet. Simple crawlers can be used by individuals to copy an entire web site to their hard drive for local viewing. For such small-scale tasks, numerous utilities like wget exist. In fact, an entire web crawler can be written in 20 lines of Python code. Indeed, the task is inherently simple: the general algorithm is shown in figure 1. However, if one needs a large portion of the web (eg. Google currently indexes over 3 billion web pages), the task becomes astoundingly difficult.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the performance of focused web crawlers

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...

متن کامل

Load Balancing Approaches for Web Servers: A Survey of Recent Trends

Numerous works has been done for load balancing of web servers in grid environment. Reason behinds popularity of grid environment is to allow accessing distributed resources which are located at remote locations. For effective utilization, load must be balanced among all resources. Importance of load balancing is discussed by distinguishing the system between without load balancing and with loa...

متن کامل

Emergent System for Information Retrieval1

Stand alone as well as distributed web crawlers employ high performance, sophisticated algorithms which, on the other hand, require a high degree of computational power. They also use complex interprocess communication techniques (multithreading, shared memory, etc). As opposed to the distributed web crawlers, the ERRIE crawler system presented in this paper displays emergent behavior by employ...

متن کامل

Design and Implementation of a High-Performance Distributed Web Crawler

Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits m...

متن کامل

HF-Blocker: Detection of Distributed Denial of Service Attacks Based On Botnets

Abstract—Today, botnets have become a serious threat to enterprise networks. By creation of network of bots, they launch several attacks, distributed denial of service attacks (DDoS) on networks is a sample of such attacks. Such attacks with the occupation of system resources, have proven to be an effective method of denying network services. Botnets that launch HTTP packet flood attacks agains...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003